llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch by ngxson · Pull Request #9745 · ggml-org/llama.cpp

ngxson · 2024-10-04T21:23:31Z

Motivation

While working on the ability to add both embeddings and tokens to the same batch, I noticed that the old API for llama_batch, namely all_pos_0, all_post_1 and all_seq_id has been there for quite a long time.

Migration guide

The recommended way is to use llama_batch_init and llama_batch_free:

llama_batch batch = llama_batch_init(n_tokens, 0, 1); // allocate a batch of n_tokens and one sequence ID
batch.n_tokens = n_tokens;
for (int i = 0; i < n_tokens; i++) {
    batch. token[i] = tokens[i]; // copy token into batch
    batch.   pos[i] = n_past + i; // set correct position for each token
    batch.seq_id[i][0] = 0; // all tokens are in sequence 0
}
batch.logits[n_tokens - 1] = true; // only get logits for last token

if (llama_decode(ctx, batch)) {
    LOG_ERR("%s : failed to eval\n", __func__);
    llama_batch_free(batch); // remember to free the batch before returning
    return false;
}

llama_batch_free(batch);

If the binary is linked against common, you can use some helper functions:

common_batch_add to add a new token into the batch
common_batch_clear to remove all tokens from the batch

If your use case is using single sequence, then you can adapt to the new call signature of llama_batch_get_one (although, this is not recommended):

if (llama_decode(ctx, llama_batch_get_one(tokens.data(), tokens.size()))) {
    LOG_ERR("%s : failed to eval\n", __func__);
    return false;
}

The position of tokens will be tracked automatically by llama_decode. For example, if the first time, you call llama_decode on a batch of 10 tokens, then the next time llama_decode will start decoding from position 11.

I have read the contributing guidelines
Self-reported review complexity:
- Medium

slaren · 2024-10-04T21:35:20Z

I don't see a clear motivation for removing this. I believe that single sequence usage is by far the most common way llama.cpp is used, and removing this function will require most applications to add a lot of boilerplate. We should aim to make the llama.cpp API as simple as possible to use.

ngxson · 2024-10-04T21:54:45Z

My main motivation for this PR is that instead of having an API call solely for keeping backward-compatibility, we could keep it as an utility, not a core API.

Second motivation is thatllama_decode accepts a batch of multiple sequences. It doesn't really care if in the batch there is 1 or many sequences. Therefore, the llama_batch should reflect that.

Keeping these backward-compat struct member makes the code inside llama_sbatch.add_seq_to_ubatch to have 2 different branches that (almost) does the same thing. Instead of maintaining these 2 if-else branches, I believe the better solution would be to generate all the pos, n_seq_id, logits from all_pos_0, all_post_1, all_seq_id. In other words, one can be computed from the other:

(all_pos_0, all_pos_1, all_seq_id) --> (pos, n_seq_id, logits)

ngxson · 2024-10-04T22:07:29Z

I believe that single sequence usage is by far the most common way llama.cpp is used, and removing this function will require most applications to add a lot of boilerplate.

I think in this use case, simple specify n_tokens, token, pos is enough, as most users never use more than one sequence. Look this way, all_pos_0, all_pos_1, all_seq_id are still redundant.

So if we really want to simplify the usage for end user, we could allow user to only set n_tokens, token, pos in llama_batch; The sequence ID can be forced to 0 in this case (as in all examples, llama_batch_get_one is always used with seq_id=0)

Even more simple, pos can be tracked internally by looking at KV cache, so user can simply input a list of tokens.

slaren · 2024-10-04T22:44:39Z

There is a lot we could do to simplify the llama_batch API, I think that the positions could be removed entirely from the public API, but that should wait until a redesign of the API.

all_pos_0, all_pos_1, all_seq_id are not truly redundant when the alternative requires allocating an array of positions, which requires a lot more code than simply calling llama_batch_get_one. If you want to simplify the code that deals with llama_batch internally, a simple solution would be to make a function called at the start of llama_decode that transforms these fields to a list of positions, and sets the pos field of llama_batch. The same could be done to remove all_logits. Then you could remove all the code in llama.cpp that has to deal with all_pos_0, all_pos_1, all_seq_id, single sequence users could continue using llama_batch_get_one without using the common library (which we should never encourage), and it would avoid a breaking API change.

ngxson · 2024-10-05T10:33:49Z

all_pos_0, all_pos_1, all_seq_id are not truly redundant when the alternative requires allocating an array of positions, which requires a lot more code than simply calling llama_batch_get_one.

Let me clarify a bit more, what I mean was that in all examples, we always set:

all_pos_0 = n_past
all_pos_1 = 1
all_seq_id = 0

So I assume that 99% of the case, if user want to work with single-sequence (the most basic usage), then all_pos_1 and all_seq_id are redundant. The all_pos_0, as said earlier, can be tracked internally, so it's a bit redundant for now.

Then you could remove all the code in llama.cpp that has to deal with all_pos_0, all_pos_1, all_seq_id, single sequence users could continue using llama_batch_get_one without using the common library (which we should never encourage), and it would avoid a breaking API change.

The problem with such change is that even without touchingllama_batch_get_one, just removing all_pos_0, all_pos_1, all_seq_id is already a breaking change. That's because the shape of struct llama_batch will be changed.

It seems OK for me to keep llama_batch_get_one in the core library though. One idea is that it can produce a batch with just n_tokens, token being set and pos can be tracked internally. This will cover 99% of basic usage case (single-seq), and for the 1% use case where user still want single-seq but with custom token positions, they now need to construct the batch themself.

In any cases, I still strongly prefer to remove all_pos_0, all_pos_1, all_seq_id altogether, because it's no longer a recommended usage.

slaren · 2024-10-05T12:37:15Z

Sounds goods to me. Other than causing an ABI break, removing all_pos_0, all_pos_1, all_seq_id will probably not break much code. The important API to keep is llama_batch_get_one.

ngxson · 2024-10-11T12:49:05Z

include/llama.h

    // - pos    : the positions of the respective token in the sequence
+    //            (if set to NULL, the token position will be tracked automatically by llama_decode)
    // - seq_id : the sequence to which the respective token belongs
+    //            (if set to NULL, the sequence ID will be assumed to be 0)
    // - logits : if zero, the logits (and/or the embeddings) for the respective token will not be output
+    //            (if set to NULL, only the logits for last token will be returned)
    //


@slaren @ggerganov I updated the behavior of llama_batch to adapt to the removal of all_pos_0, all_pos_1, all_seq_id, please let me know what you think about this implementation. Thank you!

ggerganov · 2024-10-11T14:19:39Z

examples/save-load-state/save-load-state.cpp

        result2 += next_token_str;

-        if (llama_decode(ctx3, llama_batch_get_one(&next_token, 1, n_past, 1))) {
+        if (llama_decode(ctx3, llama_batch_get_one(&next_token, 1))) {


This will generate a batch for seq_id == 0 and it needs to be seq_id == 1

make -j && ./llama-save-load-state -m ${some_model}

Thanks for spotting that! Fixed in 6395174

ggerganov · 2024-10-11T14:46:48Z

examples/perplexity/perplexity.cpp

            const int batch_start = start + j * n_batch;
            const int batch_size  = std::min(end - batch_start, n_batch);

+            llama_batch batch = llama_batch_init(batch_size, 0, 1);


Move the llama_batch outside the loop and reuse it. Maybe utilize the common_batch_ API to make it little less cumbersome.

Fixed in 734f9e2 and 4be7ecf

ggerganov · 2024-10-11T14:53:44Z

src/llama.cpp

+            batch.n_seq_id = n_seq_id.data();
+        }
+        if (!batch.seq_id) {
+            seq_id.resize(batch.n_tokens);


Make this also NULL terminated for consistency (see llama_batch_init):

Suggested change

seq_id.resize(batch.n_tokens);

seq_id.resize(batch.n_tokens + 1);

Fixed in 7264596

ngxson · 2024-10-12T21:43:12Z

examples/infill/infill.cpp


                llama_kv_cache_seq_rm (ctx, 0, params.n_keep + 1            , params.n_keep + n_discard + 1);
-                llama_kv_cache_seq_add(ctx, 0, params.n_keep + 1 + n_discard, n_past, -n_discard);
+                llama_kv_cache_seq_add(ctx, 0, params.n_keep + 1 + n_discard, n_past + 1,    -n_discard);


Small explanation for what's happening: We suppose to shift all tokens from n_keep + n_discard + 1, so the end of must be n_past + 1 (or we can simply set it to -1, which means [p0, inf))

Hm, I don't think n_past + 1 is needed here. There shouldn't be a token with pos == n_past in the KV cache.

But yes, using either n_past or -1 would achieve the same thing. Think using n_past is more illustrative.

ok thanks, I figured out that I counted the token from 1, not from 0. I fixed that in 5d99ae4

…l-org#9745) * refactor llama_batch_get_one * adapt all examples * fix simple.cpp * fix llama_bench * fix * fix context shifting * free batch before return * use common_batch_add, reuse llama_batch in loop * null terminated seq_id list * fix save-load-state example * fix perplexity * correct token pos in llama_batch_allocr

ngxson added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Oct 4, 2024

ngxson requested a review from ggerganov October 4, 2024 21:23

github-actions bot added examples server labels Oct 4, 2024

github-actions bot added the android Issues specific to Android label Oct 4, 2024

ngxson added 2 commits October 11, 2024 11:48

refactor llama_batch_get_one

b226c5b

adapt all examples

1c48616

ngxson force-pushed the xsn/llama_batch_remove_compat branch from 697a3f9 to 1c48616 Compare October 11, 2024 10:11

ngxson added 5 commits October 11, 2024 12:14

Merge branch 'master' into xsn/llama_batch_remove_compat

9970316

fix simple.cpp

9276950

fix llama_bench

59fd6b6

fix

7740c96

fix context shifting

6a9769a

ngxson commented Oct 11, 2024

View reviewed changes

ggerganov reviewed Oct 11, 2024

View reviewed changes

ngxson changed the title ~~llama : move llama_batch_get_one from core library to common~~ llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch Oct 11, 2024

ggerganov reviewed Oct 11, 2024

View reviewed changes

ngxson added 6 commits October 12, 2024 22:47

free batch before return

0639ff1

Merge branch 'master' into xsn/llama_batch_remove_compat

b4c9911

use common_batch_add, reuse llama_batch in loop

734f9e2

null terminated seq_id list

7264596

fix save-load-state example

6395174

fix perplexity

4be7ecf

ngxson commented Oct 12, 2024

View reviewed changes

ngxson mentioned this pull request Oct 15, 2024

Add PaliGemma Support #7553

Open

ngxson added 2 commits October 18, 2024 15:41

Merge branch 'master' into xsn/llama_batch_remove_compat

9dd7e77

correct token pos in llama_batch_allocr

5d99ae4

ngxson requested a review from ggerganov October 18, 2024 13:57

ggerganov approved these changes Oct 18, 2024

View reviewed changes

ngxson merged commit cda0e4b into ggml-org:master Oct 18, 2024

ngxson mentioned this pull request Oct 19, 2024

changelog : libllama API #9289

Open

slaren mentioned this pull request Oct 19, 2024

Bug: Segmentation fault when running speculative decoding #9949

Open

jakexcosme mentioned this pull request Oct 22, 2025

changelog : libllama API COG-GTM/llama.cpp#246

Open

	seq_id.resize(batch.n_tokens);
	seq_id.resize(batch.n_tokens + 1);

Conversation

ngxson commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Migration guide

Uh oh!

slaren commented Oct 4, 2024

Uh oh!

ngxson commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Oct 4, 2024

Uh oh!

ngxson commented Oct 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Oct 5, 2024

Uh oh!

ngxson Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 12, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 12, 2024

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented Oct 4, 2024 •

edited

Loading

ngxson commented Oct 4, 2024 •

edited

Loading

ngxson commented Oct 4, 2024 •

edited

Loading

ngxson commented Oct 5, 2024 •

edited

Loading

ngxson Oct 12, 2024 •

edited

Loading

ngxson Oct 12, 2024 •

edited

Loading

ggerganov Oct 14, 2024 •

edited

Loading